Extreme deconvolution: inferring complete distribution functions 
from noisy, heterogeneous and incomplete observations 
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ABSTRACT 

We generalize the well-known mixtures of Gaussians approach to density es- 
timation and the accompanying Expectation-Maximization technique for finding 
the maximum likelihood parameters of the mixture to the case where each data 
point carries an individual ci-dimensional uncertainty covariance and has unique 
missing data properties. This algorithm reconstructs the error- deconvolved or 
"underlying" distribution function common to all samples, even when the indi- 
vidual data points are samples from different distributions, obtained by convolv- 
ing the underlying distribution with the unique uncertainty distribution of the 
data point and projecting out the missing data directions. We show how this 
basic algorithm can be extended with Bayesian priors on all of the model param- 
eters and a "split-and-merge" procedure designed to avoid local maxima of the 
likelihood. We apply this technique to a few typical astrophysical applications. 



1. Introduction 

Inferring a distribution function given a finite set of samples from this distribution func- 
tion is a problem of considerable general interest. The literature contains many density esti- 
mation techniques, ranging from simply binning the data in a histogram and smoother ver- 
sions of this collectively known as kernel density estimation, to more sophisticated techniques 
such as non-parametric (penalized) maximum likelihood fitting (see, e.g., Silverman 1986, for 
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a review of all of the previous methods) and full Bayesian analyses (e.g., Diebolt and Robert 
1994; Richardson and Green 1997). These techniques perform well in the high signal-to- 
noise regime, i.e., when one has "good data", however, in scientific applications the data 
generally come with large and heterogeneous uncertainties and one often only has access to 
lower dimensional projections of the full data, i.e., there are often missing data. As a sci- 
entist, you are not interested in the observed distribution, what you really want to know is 
the underlying distribution, i.e., the distribution you would have if the data had vanishingly 
small uncertainties and no missing data. In this paper we describe a general approach for 
inferring distribution functions when these complications are present. 

A frequently used density estimation technique is to model the distribution function as 
a sum of Gaussian distributions by optimizing the likelihood of this model given the data 
(e.g. McLachlan and Basford 1988). We show that this approach can be generalized in the 
presence of noisy, heterogeneous, and incomplete data. The likelihood of the model for each 
data point is given by the model convolved with the (unique) uncertainty distribution of that 
data point; the objective function is obtained by simply multiplying these individual likeli- 
hoods together for the various data points. Optimizing this objective function one obtains 
a maximum likelihood estimate of the distribution (more specifically, of its parameters). 

While optimization of this objective function can, in principle, be performed by a generic 
optimizer, we develop an Expectation-Maximization (EM) algorithm that optimizes the ob- 
jective function. This algorithm works in much the same way as the normal EM algorithm for 
mixture of Gaussians density estimation, except that an additional degree of incompleteness 
is given by the actual values of the observables, since we only have access to noisy projections 
of the actual observables; in the expectation step these actual values are estimated based 
on the noisy and projected measured values and the current estimate of the distribution 
function. In the limit in which the noise is absent but the data are lower dimensional pro- 
jections of the quantities of interest, this algorithm reduces to the algorithm described in 
Ghahramani and Jordan (1994a,b). 

We also show how Bayesian priors on all of the parameters of the model can be natu- 
rally included in this algorithm as well as how a split-and-merge procedure that heuristically 
searches parameter space for better approximations to the global maximum can also be 
incorporated in this approach. These priors and the split-and-merge procedure can be im- 
portant when applying the EM algorithm developed here in situations with real data where 
the likelihood surface can have a very complicated structure. We also discuss briefly the 
practical issues having to do with model selection in the mixture model approach. 

Applications to real data sets are discussed in detail in Section 6, both in general terms, 
i.e., why the approach we put forward here is more appropriate when dealing with noisy. 
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heterogeneous data than the more traditional density estimation techniques, as well as in 
some concrete examples. These concrete examples show that the technique developed in 
this paper performs extremely well even when the underlying distribution function has a 
complicated structure. 

The technique we describe below has many applications besides returning a maximum 
likelihood fit to the error-deconvolved distribution function of a data sample. For instance, 
when an estimate of the uncertainty in the estimated parameters or distribution function 
is desired or when a full Bayesian analysis of the mixture model preferred, the outcome 
of the maximum likelihood technique developed here can be used as a seed for Markov 
Chain Monte Carlo (MCMC) methods for finite mixture modeling (e.g., Diebolt and Robert 
1994; Richardson and Green 1997). Another possible application concerns fitting a linear 
relationship to a data set {(xj, t/i)} when the data has non-negligible errors both in x as well 
as in y, which can be correlated. This problem can be thought of as fitting the underlying, 
error-deconvolved distribution of the points with a Gaussian; the linear relationship 

then corresponds to the direction of the largest eigenvalue of the underlying distribution's 
covariance matrix. We describe this application in a specific case in section 6.2. 

2. Likelihood of a mixture of Gaussian distributions given a set of 



Our goal is to fit a model for the distribution of a rf-dimensional quantity v using a 
set of observational data points Wj. Therefore, we need to write down the probability of 
the data under the model for the distribution. The observations are assumed to be noisy 
projections of the true values Vj 



where the noise is drawn from a Gaussian with zero mean and known covariance tensor 
Sj. The case in which there is missing data occurs when the projection matrix Rj is rank- 
deficient. Alternatively, we can handle the missing data case by describing the missing data 
as directions of the covariance matrix that have a formally infinite eigenvalue; In practice 
we use very large eigenvalues in the noise-matrix. When the data has only a small degree 
of incompleteness, i.e., when each data point has only a small number of unmeasured di- 
mensions, this latter approach is often the best choice, since one often has some idea about 
the unmeasured values. For example, in the example given below of inferring the velocity 
distribution of stars near the Sun we know that the stars are moving at velocities that do 
not exceed the speed of light, which is not very helpful, but also that none of the velocities 



heterogeneous, noisy samples 



Wj = RjVj + noise , 
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exceed the local Galactic escape speed, since we can safely assume that all the stars are 
bound to the Galaxy. However, in situations in which each data point has observations of 
a dimensionality -C d using the projections matrices will greatly reduce the computational 
cost, since, as will become clear below, the most computationally expensive operations all 
take place in the lower dimensional space of the observations. 

We will model the distribution p(v) of the true values v as a mixture of K Gaussians: 



where the amplitudes aj sum to unity and the function A/'(v|m, V) is the Gaussian distri- 
bution with mean m and variance tensor V: 



For a given observation Wj the likelihood of the model parameters 9 = [aj, nij, V, ) given 
that observation and the noise covariance Sj, which we will write as p{wi\9), can be written 
as: 



K 




(2) 



A/'(v|m, 




(3) 





where 



p(w,|v) 
P(v|j,^) 

Pirn 



Ar(w,|R,v, S,) 
A/'(v|m,-,V,) 



(5) 



This likelihood works out to be itself a mixture of Gaussians — in order not to clutter the 
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algebra we set all of the projection matrices equal to the unit matrix here- 
p{wi\e) = ^a,- /"dvAr(w,|v,S,;)Ar(v|m,-,V,) 

= 5^a,- f dv(27r)-'^det(Si)-^/Met(V,)-i/2 

J V 



exp 



-- {(w, - v)^S-^(w, - v) + (v - m,)^V,-i(v - m,)} 



^a,- f dv(27r)-'^det(Si)-^/Met(V,)- 

J V 



1/2 



exp 



-{v-(Vri + S-V-v-(Vrim, + Sri 



W,; 



This integral works out to be 
P(w.l^) = 5^«,(27r) 



,;, det(S.)-VMet(V,)-V^ 
det(Sr^+Vri)i/2 



exp 



-mlVr^: 



3-1 



sr w,; - Wi sr 1 _ 1 ^ riij } 



1 , c-l'^i 



V,r^ + S, 



1 w, 



which simphfies to 



det(V-i + Sry/^ 



det(Sri)-i/2 det(Vri)-i/2 

1 , 

exp 

^ajAr(wi|mj,Tij-)> 



(6) 



where we have defined 



v, + s,. 



(7) 
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Restoring the projection matrices Rj, we have 
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where 



T 



Rj Vj- Rj^ + Sj . 



(9) 



The free parameters of this model can now be chosen such as to maximize an exphcit, 
justified, scalar objective function 0, given here by the logarithm (log) likelihood of the 
model given the data, i.e.. 



This function can be optimized in several ways, one of which is to calculate the gradients and 
use a generic optimizer to increase the likelihood until it reaches a maximum. This approach 
is complicated by parameter constraints (e.g., the amplitudes aj must all be non- negative 
and add up to one, the variance tensors must be positive definite and symmetric). In what 
follows we will describe a different approach: An EM algorithm that iteratively maximizes 
the likelihood, while naturally respecting the restrictions on the parameters. 

3. Fitting Mixtures with heterogeneous, noisy data using an EM algorithm 

The problem of finding a maximum likelihood estimate of the parameters of the mixture 
of Gaussians model by optimizing the total log likelihood given in equation (10) is not a 
problem with missing data. However, optimization of the total log likelihood is difficult and 
an analytical solution is not possible for K > 1. An analytical solution does exist when K = 1 
(when the error covariances Sj are equal; see below). Therefore, if we knew which Gaussian 
a specific data point was sampled from, optimization would be simple. The formulation of 
the Gaussian mixture density estimation as a hidden data problem takes advantage of this 
fact (Dempster et al. 1977). 

When data is actually missing and/or when the data is noisy (error covariances Sj not 
equal to zero), an analytical solution does not exist anymore. As we will show below, in this 
case formulating the problem as a missing data problem can also lead to a simple, iterative 
algorithm that leads to a maximum likelihood estimate of the model parameters. First we 
will recapitulate how the EM algorithm for Gaussian mixtures works by applying it to the 
basic problem of fitting a set of data points with a mixture of Gaussians, thus we will set all 



K 



(f) = ^lnp(wi|6') = ^ln^ajA/'(wi|Rimj,Tij) . 



(10) 



i i j=l 
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the error covariances equal to zero. Then we will investigate how the problem can be solved 
by a similar EM algorithm when we have incomplete and/or noisy data. A short summary 
of the general properties of the EM methodology is given in appendix C. 



3.1. The EM algorithm with complete, precise observations 

In the case of complete, precise observations (i.e., S,/ = 0, Rj = I, V i) the log likelihood 
of the model given the data from equation (10) reduces to the following log likelihood: 

K 

<P = J2 lnp(w,|^) = E 1^ E «.A^(w.|m„ V,) . (11) 

i i j=l 

Formulating this problem as a missing data problem introduces the indicator variables qij 
which indicate whether a data point i was sampled from Gaussian j, i.e., 

f 1 if data point i was generated by Gaussian j 
\ if data point i was not generated by Gaussian j 

This variable can take on values between these extreme values, in which corresponds 
to the probability that data point i was generated by Gaussian j. In any case, for every data 
point we have that qij = 1. 

Using this hidden indicator variable we can write the "full-data" log likelihood as 

K 

$ = E E 1^ «iAr(w,|m„ V,) . (13) 

i j=l 

Using Jensen's inequality (e.g., MacKay 2003) it is easy to see that optimizing this full-data 
log likelihood also optimizes the original log likelihood equation (11). Jensen's inequality for 
a concave function /, numbers Xj, and weights qj can be stated as 

(14) 



The logarithm is a concave function and, defining pij = ajA/'(wj|mj, V,), if we choose x 



- 8 - 



Pij/qij and weights qij, the indicator variables defined above, we find (since lij = 1) 

K K 

In J^Pij = InJ^gij^ 

j=l 3=1 
K 

3=1 '^'^ 
K K 

> Y Qij In Pij - 2^ Qij In Qij , (15) 
i=i 3=1 



with equahty when we set 



Q^3 = ^ , (16) 



since then 

K K 



El Pij Pii , f Y.kPik 

,-1 % ptJ2kP^k V P^3 




(17) 

This proves that the EM algorithm applied to the full-data log likelihood in equation (13) 
leads to a maximum likelihood estimate of the model parameters for the original likelihood, 
since the calculation of the E-step, i.e., taking the expectation of the full-data log likelihood, 
essentially reduces to calculating the expectation of the indicator variables Qij given the data 
and the current model estimate, which is exactly setting the qij equal to the posterior prob- 
ability of the data point i belonging to Gaussian j, as given in equation (16). Optimization 
of the F-function from equation (15) with respect to the model parameters is equivalent 
to optimization of the full-data log likelihood, since the second term in the definition of F 
does not depend on the model parameters and the first term is equal to the full-data log 
likelihood. This view of the EM algorithm for a mixture of Gaussians (Hathaway 1986) 
can also be used to argue for different E- and/or M-steps in order to speed up convergence 
(Neal and Hinton 1998). For example, one can choose particular data points to target in the 
E-step or specific model parameters in the M-step. 
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In the M step we maximize ($) with respect to the model parameters 9. The constraint 
on the amphtudes (i.e., that they add up to one) can be implemented by adding a Lagrange 
multiplier A and an extra term A {^jOtj — to ($). Taking the derivative of this with 



respect to a,- then leads to 



^ ' = 

da^ da 



i 

The requirement that the aj add up to one leads to 

^a^. = 1 ^ $^-^$^% = l 

j j i 

* j 

i 

^ \ = -N, (19) 

where we have used the fact that = 1 and is the number of data points. Therefore 

the optimal value of aj is 

i 

The rest of the optimization reduces to optimizing the reduced log likelihood 
(•^red) =J2Y1 [1^^^* + ~ mi)^Vj"'(wi - m^)] , 

i j 

which we can simplify by using (1) that Indet V,- = Trace In V,-, (2) that for any number a 
we can write that a = Trace(a), and (3) the cyclical property of the trace: 

^(*red) = [^f^^^J - VrMVjV-i(w, - mj)(wi - mjY - 2Y-\wi - mj)c?mj] 

(21) 
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where we have used the fact that dV- ^ = — V^- ^(iV,Vj- ^ (which one can derive by differen- 
tiating V,Vj~^ = !)• This is equal to zero when 



rrij 



Ei9u(wi-mj)(wi-m^ 



To summarize, the hkehhood can be optimized by alternating the following E- and 
M-steps: 

E-step: Qij <r 



M step: ^ ^ ^ q,j 

i 







Efc"fe-^(wi| 


|mfc, Vfc) 



nij ^ — > QiiWi 



Qi i 



where = Ei%- 



A fatal flaw of the maximum likelihood technique described here, and we must emphasize 
that this is a flaw of the objective function and not of the EM algorithm, is that the likelihood 
is unbounded. Indeed, when one of the means nij is set equal to one of the data points, 
reducing the covariance matrix V,- of that Gaussian will lead to an unbounded increase in 
the probability of that data point, i.e., the Gaussian becomes a delta distribution centered 
on a particular data point and the model therefore has a infinite likelihood. This problem 
can be dealt with in a couple of different ways, some of which will be described below. 
We will describe a solution which assumes a prior for the model covariances, which leads 
to a regularization of the covariances at every step such that they cannot become zero. 
However, we will use this technique to deal with the related problem of the model covariances 
reaching a maximum likelihood at the edge of their domain, i.e., when the covariance becomes 
zero without making the likelihood infinite. In the next section we will describe a solution 
to the unbounded likelihood problem that is especially well motivated when dealing with 
experimental or observational data, i.e., taking into account the measurement uncertainties. 
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3.2. The EM algorithm with heterogeneous, noisy data 



The problem we would like to solve has an additional component of incompleteness. 
The data we are using are noisy, as described by their covariance matrices Sj. This noise can 
vary between small fluctuations to a complete lack of information for certain components of 
the underlying quantity v (as indicated by a formally infinite contribution to the covariance 
matrix). We can use the EM algorithm to deal with this second kind of incomplete data as 
we\\\ 

In the case of full-rank projection matrices, we could try to proceed exactly as we did 
in the case of complete data with noise covariance matrices Si equal to the zero matrix. The 
E-step remains the same and the part of the M-step that updates the amplitudes will also 
be the same as before, however, in order to optimize the means and covariances, we would 
have to solve an equation similar to equation (21), i.e.. 



^(•^red) = E [Tr.irfT,, - Tr^'dT,,T7^\w, - m,)(w, - m,)^ - 2T,r.i(w, - m,)rfmj] = , 



in which we simply use Tjj = Vj + Sj instead of V,-. This equation cannot be solved 
analytically to give us update steps for the means and covariances of the Gaussians when 
the noise covariances Sj are different for each observation. A reasonable way to deal with this 
would be to use a generic optimizer to perform the M-step optimization, which would still give 
a better result than using a generic optimizer for the whole problem since the dimensionality 
of the problem is greatly reduced, however, we will describe a different procedure which deals 
with this problem in a similar way to how the EM algorithm simplified the complete data 
case. 

Essentially, what we will do is consider the situation of noisy measurements, which may 
or may not be projections of higher dimensional quantities, as a missing data problem in 
itself. That is, we will consider the true values Vj as extra missing data (in addition to the 
indicator variables qij). This allows us to write down the "full data" log likelihood as follows 



We will now show how we can use the EM methodology to find straightforward update steps 
that maximize the full data likelihood of the model. Then we will prove that these updates 
also maximize the likelihood of the model given the noisy observations. 



(24) 




(25) 



^This algorithm was developed independently before (Diebolt & Celeux, 1989, unpublished; Diebolt & 
Celeux, 1990, unpublished). 
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The E-step consists as usual of taking the expectation of the full data likelihood with 
respect to the current model parameters 9. Writing out the full data log likelihood from 
equation (25) we find 



d 



- ln(27r) - - Indet V, - -(v, - m,)-V,r^(v, 



(26) 



which shows that in addition to the expectation of the indicator variables qij for each com- 
ponent we also need the expectation of the qijWi terms and the expectation of the q'ijVjV^ 
terms given the data, the current model estimate and the component j. The expectation of 
the Qij is again equal to the posterior probability that a data point Wj was drawn from the 
component j (see equation 16). The expectation of the Vj and the VjV^ can be found as fol- 
lows: Consider the probability distribution of the vector [v^ w,^]^ given the model estimate 
and the component j. From the description of the problem we can see that this vector is 
distributed normally with mean 



m 



Rjirij 



and covariance matrix 



v 



Vi V^R/ 



(27) 



(28) 



The conditional distribution of the Vj given the data Wj is normal with mean (see appendix 
B) 

bj, = nij + V,RTT-.i ( Wj - Ri nij ) (29) 



and covariance matrix 



(30) 



Thus we see that the expectation of Vj given the data Wj, the model estimate, and the 
component j is given by bjj , whereas the expectation of the v^v^^ given the same is given by 
Bij + hijhjj. 

Given this the expectation of the full data log likelihood is given by 
($) = ^ qij In aj - ^ ln(27r) - ^Trace [in V,- + {B^j + bj^bT. - bi^mj - m^bT. + mj-mj) Vr^] 



\naj ln(27r) Trace [inVj 



fB,. 



(31) 



The update step for the amplitudes aj is given as before by equation (20). Dropping the 
ln(27r) term the differential of the reduced expectation of the full data log likelihood ('^'j-gj^) 
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is given by 

d('^'red) = ^ij %Trace [Vf'dVj - Vr^dVjVf^ [(b,, - m,)(b,, - m,)^ + Bi,] - 2Vr\hij - m,)dmj] 

(32) 

The complete EM algorithm given incomplete, noisy observations is then given by 
E-step: Qij <- 





Rj lUj, Tij) 







M step: a^- ^ — ^ 



nij ^ yjj >_.jj 



-J- ^ g^j- [(mj - bij) (mj - b^^)^ + Bi^] , (33) 



where, as before, = 



We can prove that this procedure for maximizing the full data likelihood also maxi- 
mizes the log likelihood of the data w, given the model. We use Jensen's inequality in the 
continuous case (e.g., MacKay 2003), i.e., 

/ (^J dvg(v)p(v)^ > J dvg(v)/(p(v)), (34) 

for a concave function / and a non-negative integrable function g, where we have assumed 
that g is normalized, i.e., g is a probability distribution. For each observation w we can then 
introduce a function g(v, j) such that 

\iip{w\9) = In^^ / dvp(w, V, j|6') 

lnp(w|#) > f(wk,») = {lnp(w,v,i|#)), + H(«), (35) 

where 6, as before, represents the set of model parameters, and 7i is the entropy of the 
distribution g(v, j). This inequality becomes an equality when we take 

9(v, j) = j|w,6') , (36) 
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since then 



= ^ dvp{Y,j\w,9)\np{w\9) 

= lnp(w|6')^ / dvp(v, j|w,6') 

= lnp(w|^). (37) 

The above holds for each data point, and we can write 

p{vj\wi,e) =p{v\wi,e,j)p{j\wi,e). (38) 

The last factor reduces to calculating the posterior probabilities qij = p{j\wi,9) and we 
can write the F function as (we drop the entropy term here, since it plays no role in the 
optimization, as it does not depend on the model parameters) 

F = ^qij dYp{-v\wi,9,j)\np{wi,\,j\9) 

id ^ 

= ^^ij dvp(v|wi,6', j) lnp(j|6')p(wi, v|6',j) 

id ^ 

= ^lii dvp(v|wi,6', j) lnp(j|6')p(wi|v)p(v|6',j) 

id ^ 

= X^^ii / dvp(v|wi,6',j) [Inttj + lnp(wi|v) + lnj9(v|6', j)] (39) 



^qij lnaj+ / dvp(v|wi, 6^, j) 
id ^ 



ilndetV,-i(v-m,)-V,ri(v 



where we dropped the lnp(wj|v) terms since they don't depend on the model parameters 
directly. This shows that this reduces exactly to the procedure described above, i.e., to 
taking the expectation of the Vj and VjV^ terms with respect to the distribution of the Vj 
given the data Wj, the current parameter estimate, and the component j. We conclude that 
the E-step as described above ensures that the expectation of the full data log likelihood 
becomes equal to the log likelihood of the model given the observed data. Optimizing this 
log likelihood in the M-step then also increases the log likelihood of the model given the 
observations. Therefore the EM algorithm we described will increase the likelihood of the 
model in every iteration, and the algorithm will approach local maxima of the likelihood. 
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Convergence is identified, as usual, as extremely small incremental improvement in the log 
likelihood per iteration. 

Some care must be taken to implement these equations in a numerically stable way. In 
particular, care should be taken to avoid underflow when computing the ratio of a small 
probability over the sum of other small probabilities (see appendix A). Notice that we don't 
have to explicitly enforce constraints on parameters, e.g., keeping covariances symmetric and 
positive definite, since this is taken care of by the updates. For example, the update equation 
for Vj is guaranteed by its form to produce a symmetric positive-semidefinite matrix. 

In some cases we might want to keep some of the model parameters fixed during the EM 
steps, and for the means and the covariances this can simply be achieved by not updating 
them during the M step. However, if we want to keep certain of the amplitudes fixed we 
have to be more careful, as we have to satisfy the constraint that the amplitudes add up to 
one at all times. Therefore, if j' indexes the free amplitudes and j" the set of amplitudes 
we want to keep fixed, the Lagrange multiplier term we have to add to the functional F is 
A '^j' ~ 1 + '^j" ) ) which leads to the following update equations for the amplitudes 



Singularities and local maxima are two problems, that can severely limit the general- 
ization capabilities of the computed density estimates for inferring the densities of unknown 
data points. These are commonly encountered when using the EM algorithm to iteratively 
compute the maximum likelihood estimates of Gaussian mixtures. Singularities arise when 
the covariance in equation (23) or equation (33) becomes singular, i.e., when a Gaussian in 
the fit becomes very peaked or elongated. A naive way of dealing with this problem is by 
artificially enforcing a lower bound on the variances of the Gaussians in the mixture, but 
this usually results in very low inference capabilities of the resulting mixture estimate. We 
will present a regularization scheme based on defining a Bayesian prior distribution on the 
parameter space below. 

The problem of local maxima is a natural consequence of the EM algorithm as presented 
above: as the EM procedure ensures a monotonic increase in log likelihood of the model, the 
algorithm will exit when reaching a local maximum, since it cannot reach the true maximum 
without passing through lower likelihood regions. A solution to this problem therefore has to 
discontinuously change the model parameters, thereby quite possibly ending up in a region of 




(40) 



4. Extensions to the basic algorithm 
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parameter space with a lower hkehhood than the first estimate. As a result the log likelihood 
is no longer guaranteed to increase monotonically, however, decreases in log likelihood are 
rare and concentrated at these discontinuous jumps. A well-motivated algorithm based 
on moving Gaussians from overpopulated regions to underpopulated regions of parameter- 
space is described below. However, instead of this deterministic approach, a stochastic EM 
procedure could also be used to explore the likelihood surface (Broniatowski et al. 1983; 
Celeux and Diebolt 1985, 1986). 

We also briefly discuss methods to set the remaining free parameters, the number of 
Gaussians K and the hyperparameters introduced in the Bayesian regularization described 
below. 



4.1. Bayesian regularization 

A general Bayesian regularization scheme consists of putting prior distributions on the 
Gaussian mixtures parameters space 6 = {aj, m^, Vj) (Ormoneit and Tresp 1996). A conju- 
gate prior distribution of a single multivariate normal distribution is the product of normal 
density Af{mj\m.,r]~^Yj) and a Wishart density >V(Vj~^|u;, W) (Gelman et al. 2000), where 

W(V-V, W) = c(u;, W)|V-^|'^"('^+^)/2 [_Trace [WV-^]] , (41) 

with c(cij, W) a normalization constant. The proper prior distribution on the amplitudes aj 
is a Dirichlet density V{a\'y), given by 

D(«i7)=&n«?"'' (42) 

j 

where 6 is a normalizing factor, aj > 0, and aj = 1 (as is the case for the amplitudes in 
the density mixture). The log likelihood we want to maximize then becomes 

^\np{wi\e) ^ ^lnp(wil^) + lnP(a|7) + 5^ [lnAr(mj-|m,r/^^Vy) +ln>V(VrV,W)] . 

i i j 

(43) 

We can again use the EM algorithm to find a local maximum to this function. The 
E-step is the same E-step as before (eq. (33)). In the M step the functional we have to 
maximize is the same functional as in the unregularized case (given in eq. (31)) with added 
terms: 

($) ^ ($) +7^(gi) + lnP(a|7) + ^ [lnAA(mj|m,r/~^V^-) +ln>V(VrV,W)] . (44) 

j 
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of which we can take derivatives with respect to the model parameters again, which leads to 
the following update equations 



E» lij + 7j - 1 



' qj + l + 2{uj-{d + l)/2) ■ ^ ' 

These update equations have many free hyperparameters without well-motivated values. 
All of these could be used in principle to regularize the model parameters. For example, the 
7j could be used to keep the amplitudes aj from becoming very small. When one does not 
want to set these hyperparameters by hand, their optimal values can all be obtained by the 
model selection techniques described below. However, in what follows we will focus on the 
regularization of the covariances. It is easy to see from the update equation for the variances 
Vj that the updated variances are bound from below, since the numerator is greater or equal 
to 2W and the denominator has an upper limit. Therefore we can focus on the matrix W 
for the purpose of the regularization by setting the other parameters to the uninformative 
values: 

7, = 1 Vj, uj = {d+l)/2, 7] = 0. (46) 

We can further reduce the complexity to one free parameter by setting W = w/2I, where I 
is the (i-dimensional unit matrix. The update equations in the M step then reduce to 



M step: aj ^ ^ ^ qij 



N 



1 



^ Qij [{uij - hij) {nij - hijY + B,j] + wl 



(47) 



The best value to use for the parameter w is not known a priori but can be determined, 
e.g., by jackknife leave-one-out cross validation. 



4.2. Split and merge algorithm for avoiding local mcixima 



The split and merge algorithm starts from the basic EM algorithm, with or without 
the Bayesian regularization of the variances, and jumps into action after the EM algorithm 
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has reached a maximum, which more often than not will only be a local maximum. At this 
point three of the Gaussians in the mixture are singled out, based on criteria detailed below, 
and two of these Gaussians are merged while the third Gaussian is split into two Gaussians 
(Ueda et al. 1998). Let us denote the indices of the three selected Gaussians as ji,j2, and 
js, where the former two are to be merged while js will be split. 

The Gaussians corresponding to the indices ji and j2 will be merged as follows: the 
model parameters of the merged Gaussian j[ are 

Qn+Qn 

where 6j stands for mj and V^. Thus, the mean and the variance of the new Gaussian is a 
weighted average of the means and variances of the two merging Gaussians. 

The Gaussian corresponding to js is split as follows: 

"i2 = "is ="i3/2 

V,, = V,. =det(V,3)i/'^I. (49) 

Thus, the Gaussian is split into equally contributing Gaussians with each new Gaussian 
having a covariance matrix that has the same volume as V^g. The means rrij^ and rrij^ can 
be initialized by adding a random perturbation vector ej^ to rrijg, e.g., 

^jL = + , (50) 
where ||ej„Jp <^ det(Vj^y^''- and m = 1,2. 

After this split and merge initialization the parameters of the three affected Gaussians 
need to be re-optimized in a model in which the parameters of the unaffected Gaussians 
are held fixed. This can be done by using the M step in equation (47) for the parameters 
of the three affected Gaussians, while keeping the parameters of the other Gaussians fixed, 
including the amplitudes, i.e., we need to use the update equation (40) for the amplitudes. 
This ensures that the sum of the amplitudes of the three affected Gaussians remains fixed. 
This procedure is called the partial EM procedure. After convergence this is then followed by 
the full EM algorithm on the resulting model parameters. Finally the resulting parameters 
are accepted if the total log likelihood of this model is greater than the log likelihood before 
the split and merge step. If the likelihood doesn't increase the same split and merge procedure 
is performed on the next triplet of split and merge candidates. 

The question that remains to be answered is how to choose the 2 Gaussians that should 
be merged and the Gaussian that should be split. In general there are K{K — 1){K — 2)/2 
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possible triplets like this which quickly reaches a large number when the number of Gaussians 
K gets larger. In order to rank these triplets one can define a merge criterion and a split 
criterion. 

The merge criterion is constructed based on the observation that if many data points 
have equal posterior probabilities for two Gaussians, these Gaussians are good candidates to 
be merged. Therefore one can define the merge criterion: 

JmergeUA;|^) = P,(e)^P,(^), (51) 

where Pj(6') = {qu-, ■ ■ ■ iQiNY is the A^-dimensional vector of posterior probabilities for the 
jth Gaussian. Pairs of Gaussians with larger Jmerge are good candidates for a merger. 

We can define a split criterion based on the KuUback-Leibler distance between the local 
data density around the /th Gaussian, which can be written in the case of complete data as 
Piiyf) = l/qiJ2i1ii^i^ ~ Gaussian density specified by the current model 

estimates m; and V;. The Kullback-Leibler divergence between two distributions p{x) and 
q{x) is given by (MacKay 2003): 

Dkl{P\\Q)= J dxp(x)ln^. (52) 

Since the local data density is only non-zero at a finite number of values, we can write this 

as 

' " ^ ^"'^ ^ (53) 



'^split(^l^) = ^Zl^^' ^''(^) -l^-^^^^l™''"^' 



The larger the distance between the local density and the Gaussian representing it, the larger 
"^plit better candidate this Gaussian is to be split. 

When dealing with incomplete data determining the local data density is more problem- 
atic. One possible way to estimate how well a particular Gaussian describes the local data 
density is to calculate the Kullback-Leibler divergence between the model Gaussian under 
consideration and each individual data point perpendicular to the unobserved directions for 
that data point. Thus, we can write 



'^split(^l^) = ^5^*?'' ) -lnAr(w,|R,m;,R,V,R7) 



(54) 



Candidates for merging and splitting are then ranked as follows: first the merge cri- 
terion Jmerge (j7 ^ I ^) is calculated for all pairs j,k and the pairs are ranked by decreasing 
Jnierge(j; ^1^)- For each pair in this ranking the remaining Gaussians are then ranked by 
decreasing Jspiit(^l^)- 
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4.3. Setting the remaining free parameters 

No real world application of Gaussian mixture density estimation is complete without 
a well-specified methodology for setting the number of Gaussian components K and any 
hyperparameters introduced in the Bayesian regularization described above, the covariance 
regularization parameter w in our basic scheme. This covariance regularization parameter 
basically sets the square of the smallest scale of the distribution function on which we can 
reliable infer small scale features. Therefore, this scale could be set by hand to the smallest 
scale we believe we have access to based on the properties of the data set. 

In order to get the best results the parameters K and w should be set by some objective 
procedure. As mentioned above, leave-one-out cross validation (Stone 1974) could be used 
to set the regularization parameter and the number of Gaussians could be set by this 
procedure as well. The basic idea of leave-one-out cross validation is that one creates 
new data sets by sequentially taking one data point out of the original sample. For each 
of these new samples we optimize the scalar objective function and then record the 
logarithm of the likelihood of the data point that was left out under this new optimized 
parameter set. Summing up all of these cross validation log likelihoods then gives a scalar 
which can be compared for different models and the best model is the one for which the 
total cross-validation log likelihood is the largest. This procedure, while simple, can be 
quite computationally intensive and is therefore often not feasible. The procedure can be 
simplified by (1) leaving out more than one data point at a time, i.e., creating A^' new samples 
by leaving out 1 / A^'*^ of the full sample, optimizing the scalar objective using the new sample 
and recording the total cross validation likelihood of the data points left out at that step; 
(2) restricting the optimization, e.g., by starting from the optimized parameters for the full 
sample and only allowing certain parameters to vary in the individual optimization of the 
A^' new samples. For example, one can choose to only allow the amplitudes of the optimized 
parameters found for the full sample to change during the cross validation optmizations. 

Other techniques include methods based on Bayesian model selection (Roberts et al. 
1998) as well as approaches based on minimum coding inference (Wallace and Boulton 1968; 
Oliver et al. 1996; Rissanen 1978; Schwartz 1978), although these methods have difficulty 
dealing with significant overlap between components (such as the overlap we see in the 
example in Figure 2), but there are methods to deal with these situations (Baxter 1995). 
Alternatively, when a separate, external data set is available, we can use this as a test data 
set to validate the obtained distribution function. All of these methods are explored in an 
accompanying paper on the velocity distribution of stars in the Solar neighborhood from 
measurements from the Hipparcos satellite (see below; Bovy et al. 2009). 
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5. Overview of the full algorithm 

To summarize the full algorithm we briefly list all the steps involved: 

1. Run the EM algorithm as specifled in equations (33) and (47). Store the resulting 
model parameters 6* and the corresponding model log likelihood 0*. 

2. Compute the merge criterion JmeigeU^ k\6*) for all pairs j,k and the split criterion 
Jgpjj|-(/|6'*) for all I. Sort the split and merge candidates based on these criteria as 
detailed above. 

3. For the flrst triplet (j, k, I) in this sorted list set the initial parameters of the merged 
Gaussian using equation (48) and the parameters of the two Gaussian resulting from 
splitting the third Gaussian using equations (49)- (50). Then run the partial EM pro- 
cedure on the parameters of the three affected Gaussians, i.e., run EM while keeping 
the parameters of the unaffected Gaussians flxed, and follow this up by running the 
full EM procedure on all the Gaussians. If after convergence the new log likelihood (p 
is greater than 0*, accept the new parameter values 6* ^ 6 and return to step two. If 
(f) < (j)* return to the beginning of this step and use the next triplet (j, /c, /) in the list. 

4. Halt this procedure when none of the split and merge candidates improve the log 
likelihood or, if this list is too long, if none of the first C lead to an improvement. 

Deciding when to stop going down the split-and-merge hierarchy will be dictated in any 
individual application of this technique by computational constraints. This is an essential 
feature of any search-based approach to finding global maxima of (likelihood) functions. 

6. Applications to real data 

6.1. General considerations 

Inferring the distribution function of an observable given only a finite, noisy set 
of measurements of that distribution function and the related problem of finding clus- 
ters and/or overdensities in the distribution is a problem of significant general interest 
in many areas of science and of astronomy in particular (e.g., McLachlan and Basford 
1988; Rabiner and Biing-Hwang 1993; Dehnen 1998; Helmi et al. 1999; Skuljan et al. 1999; 
Hogg et al. 2005). The description you are interested in as a scientist is not the observed 
distribution, what you really want is the description of the distribution that you would have 
if you had good data, i.e., data with vanishingly small uncertainties and with all of the 
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dimensions measured. In the low signal-to-noise regime, e.g., large data sets at the bleeding 
edge in astronomy, the data never have these two properties such that the underlying, true 
distribution cannot be found without taking the noise properties of the data into account. If 
you want know the underlying distribution, in order to compare your model with the data, 
you need to convolve the model with the data uncertainties, not deconvolve the data. When 
the given set of data has heterogeneous noise properties, that is, when the error convolution 
is different for each data point, each data point is a sample of a different distribution, 
i.e., the distribution obtained from convolving the true, underlying distribution with the 
noise of that particular observation. Incomplete data poses a similar problem when the 
part of the data that is missing is different for different data points. None of the current 
density estimation techniques confront all of these issues (e.g., McLachlan and Basford 1988; 
Silverman 1986), although techniques that properly account for incomplete data have been 
developed (Ghahramani and Jordan 1994a,b). 

A first approximation to the problem of modeling the distribution of a set of real- 
valued data points that is often used is to fit a single (multivariate) normal distribution 
to the distribution of the data points, e.g., when applying principal component analysis. 
In general this is hardly ever a good approximation to the full distribution, e.g., when 
a distribution shows two distinct maxima or significant asymmetry this approximation is 
poorly suited to the problem at hand. However, fitting a distribution that is the sum 
of a large enough number of normal distributions provides a good approximation to any 
reasonably well-behaved underlying distribution. For example, a multi-modal distribution 
will be described by a sum of normal distributions with varying means, while a single-peak 
distribution with non-Gaussian tails could be fit by a set of Gaussians with similar means 
but varying amplitudes and covariance matrices. Therefore, the technique described in the 
previous sections is perfectly suited for applications with real data sets. 



6.2. A simple application: fitting a line to data with non-trivial errors 

One of the simplest applications of the algorithm described above consists of fitting a 
linear relationship to a set of data points with, possibly correlated, uncertainties in both 
the independent and the dependent variable. Many different methods have been devised 
to fit a straight line to a data set, most of them least-squares procedures (e.g., Isobe et al. 
1990; Feigelson and Babu 1992). When measurement errors are present in both variables a 
"double weighted" regression can be performed which minimizes errors both in the dependent 
variables as well as in the independent variable (York 1966; Mclntyre et al. 1966). This 
procedure can be generalized to the case of correlated errors (York 1969). In the case 



- 23 - 



of uncorrelated errors, one finds the parameters of the straight line Y = AX + B which 
minimize the quantity 

S = Y1 i^^^ ~ ^'^1^1 + ~ > (55) 

i 

where the {xi,yi) are the observed {xi,yi) adjusted along a line of slope cr^/a^ to lie on the 
straight line. This in fact assumes that the errors are highly correlated, even though by 
assumption they are not, and does not allow for the full exploration of each pairs' covariance 
ellipse. In addition to this, it is difficult to implement numerically. 

As mentioned in the introduction, the density estimation technique developed in this 
paper can be applied to this problem as well. Fitting a single Gaussian distribution to the 
data will pick out the direction of the largest variance, which one can identify with the 
direction of the linear relation between two variables. Since the technique described above 
takes care of the heterogeneous error properties of the data, this technique can be applied 
straightforwardly to this problem. This is the simplest application of the density estimation 
technique as it only concerns a single Gaussian component in the mixture, and none of the 
extensions to the algorithm have to be applied to this problem. 

In detail, one fits a single Gaussian distribution to the observed data, with no restrictions 
on the mean or variance of that Gaussian. After convergence, one identifies the direction 
corresponding to the largest eigenvalue as the direction of the linear relation. The line in 
this direction going through the mean of the best-fit Gaussian is then the desired linear fit 
to the data. 

This technique is similar to the procedures for fitting a line to data described at the 
beginning of the section in that it minimizes a quantity similar to the quantity S in equa- 
tion (55), but it uses the full uncertainty covariance matrices for each datapoint and has a 
simple numerical implementation given by the EM procedure described in this paper. The 
objective function given in equation (10) is, as shown in section 2, justifiable under the as- 
sumption of Gaussian distributed errors and a underlying Gaussian distribution. The fact 
that the model space is more general than the the model of a straight line that we want to 
fit to the data — the straight line model is in fact part of the model space as an infinitely 
long, infinitely narrow Gaussian distribution — can hardly be thought of as a disadvantage, 
as it can show that the assumption of a linear relation is suspect through the aspect ratio 
of the best fit Gaussian. 

This procedure was used in Hogg et al. (2005) to fit a line to a set of points with 
correlated errors in the abscissa and ordinate (Figure 2 in that paper). Here we fit the TuUy- 
Fisher relation in several bands from Hubble Space Telescope {HST) data as an example of 
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this procedure. The TuUy-Fisher relation is a power-law type relation between the luminosity 
and the velocity width of spiral galaxies. This relation can be used to measure distances in 
the Universe because it relates a velocity, which can be measured relatively easily by Doppler 
measurements, to the intrinsic luminosity, which is hard to observe, but depends on the easily 
observed apparent brightness of the galaxy by the distance squared (TuUy and Fisher 1977). 
The relation between luminosity L and velocity width W is of the form 

L(xW^ , (56) 

where the exponent 7 depends on the wavelength at which the luminosity is measured but 
generally lies somewhere between three and four. 

Calibrating the TuUy-Fisher relation by measuring accurate Cepheid distances to nearby 
galaxies was one of the key projects of the Hubble Space Telescope and as an illustration we 
fit the TuUy-Fisher relation in five different bandpasses here from the HST data (Table 2 of 
Sakai et al. 2000). Uncertainties in both the velocity widths (measured at 20 percent of the 
peak HI flux) of all the galaxies as well as in the absolute magnitudes are large such that 
our procedure naturally applies (the fits in Sakai et al. 2000 are bivariate fits minimizing 
errors in both variables). The resulting linear relation is shown in Figure 1 for the different 
bandpasses. Using a leave-one-out jackknife procedure to establish the uncertainties on the 



slopes and intercepts, we find the following relations 

= -(8.04 ± 0.63) (logio ^^20 - 2-5) - (19.77 ± 0.11) (57) 

= -(8.86 ± 0.57) (logio W^q - 2.5) - (20.33 ± 0.08) (58) 

i?^ = -(8.79 ± 0.50) (logio - 2.5) - (20.64 ± 0.07) (59) 

= -(9.26 ± 0.57) (logio W^2o - 2-5) - (21.12 ± 0.07) (60) 

^ = -(11.03 ± 0.75) (logio W^Q - 2.5) - (21.73 ± 0.07) . (61) 



These relations agree within the uncertainties with the relations found in Sakai et al. (2000). 

6.3. The velocity distribution from Hipparcos data 

We have applied the technique developed in this paper to the problem of inferring the 
velocity distribution of stars in the Solar neighborhood from transverse angular data from the 
Hipparcos satellite and we present in this section some results of this study to demonstrate 
the performance of the algorithm on a real data set. A more detailed and complete account 
of this study is presented elsewhere (Bovy et al. 2009). 

The astrometric ESA space mission Hipparcos, which collected data over a 3.2 year 
period around 1990, provided for the first time an all-sky catalogue of absolute parallaxes 



- 25 - 



and proper motions, with typical uncertainties in these quantities on the order of mas (ESA 
1997). From this catalogue of ~ 100, 000 stars kinematically unbiased samples of stars with 
accurate positions and velocities can be extracted (Dehnen and Binney 1998). Since astro- 
metric measurements of velocities basically just compare the position of a star at different 
times to calculate velocities, the only components of a star's velocity that can be measured 
astrometrically are the tangential components. The line-of-sight velocities of the stars in 
the Hipparcos sample were therefore not obtained during the Hipparcos mission. Since the 
proper motions that are measured by differencing sky positions are angular velocities, the 
distance to a star is necessary in order to convert the proper motions to space velocities. 

Distances in astronomy are notoriously hard to measure precisely, and at the accuracy 
level of the Hipparcos mission distances can only be reliably obtained for stars near the Sun 
(out to ~ 100 pc). In addition to this, since distances are measured as inverse distances 
(parallaxes) only distances that are measured relatively precise will have approximately 
Gaussian uncertainties associated with them. Balancing the size of the sample with the 
accuracy of the distance measurement leaves us with distance uncertainties that are typically 
~ 10-percent, such that the tangential velocities that are obtained from the proper motions 
and the distances have low signal-to-noise. 

Of course, if we want to describe the distribution of the velocities of the stars in this 
sample, we need to express the velocities in a common reference frame, which for kinematical 
studies of stars around the Sun is generally chosen to be the Galactic coordinate system, 
in which the x-axis points towards the Galactic center, the y-axis points in the direction 
of Galactic rotation, and the 2;-axis points towards the North Galactic Pole (Blaauw et al. 
1960; Binney and Merrifield 1998). The measured tangential velocities are then projections 
of the three-dimensional velocity of a star with respect to the Sun in the two-dimensional 
plane perpendicular to the line-of-sight to the star. Therefore, this projection is different for 
each individual star. 

The discussion in the previous paragraphs shows that this sample of stars exhibits all 
of the properties (incomplete data and noisy measurements, different from observation to 
observation) for which a procedure such as the one developed in this paper is necessary. 

We have studied the velocity distribution of a sample of main sequence stars selected to 
have accurate distance measurements (parallax errors aT^/ir < 0.1) and to be kinematically 
unbiased (in that the sample of stars faithfully represents the kinematics of similar stars). 
In detail, we use the sample of ~ 10,000 stars from Dehnen and Binney (1998), but we 
use the new reduction of the Hipparcos raw data, which has improved the accuracy of the 
astrometric quantities (van Leeuwen 2007a,b). A particular reconstruction of the underlying 
velocity distribution of the stars is shown in Figure 2, in which 10 Gaussians are used and 
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the regularization parameter w is set to 4 km^ s~^. We choose this value for w since we 
beheve that the smallest scale of the features we can see is a few km s~^, because of the scale 
of the differences in rotational velocities around the Galactic center of the stars in the ~100 
pc around the Sun (which is rotating around the Galactic center at ~ 200 km s~^ at ~ 8 
kpc). What is shown are two-dimensional projections of the three-dimensional distribution. 

The recovered distribution compares favorably with other reconstructions of the velocity 
distribution of stars in the Solar neighborhood, based on the same sample of stars (using a 
maximum penalized likelihood density estimation technique, Dehnen 1998), as well as with 
those based on other samples of stars for which three-dimensional velocities are available 
(Skuljan et al. 1999; Nordstrom et al. 2004; Famaey et al. 2005; Antoja et al. 2008). All of 
the features in the recovered distribution function are real and correspond to known features; 
this includes the feature at Vy ~ —100 km s~^, which is known as the Arcturus moving group. 
Therefore, we conclude that the method developed in this paper performs very well on this 
complicated data set. 

The convergence of the algorithm is shown in Figure 3. Only split-and-merge steps that 
improved the likelihood are shown in this plot, therefore, the actual number of iterations is 
much higher than the number given on the x-axis. It is clear that all of the split-and-merge 
steps only slightly improve the initial estimate from the first EM procedure, but since what 
is shown is the likelihood per data point, the improvement of the total likelihood is more 
significant. 

7. Implementation and code availability 

The algorithm presented in this paper was implemented in the C programming language, 
depending only on the standard C library and the GNU Scientific Library^. The code 
is available at http://code.google.eom/p/extreme-deconvolution/; Instructions for its 
installation and use are given there. The code can be compiled into a shared object library, 
which can then be incorporated into other projects or accessed through an IDL^ wrapper 
function supplied with the C code. 

The code can do everything described above. The convergence of the algorithm is slow, 
which is mostly due to the large number of split-and-merge steps that can be taken by 
the algorithm (the split-and-merge aspect of the algorithm, however, can easily be turned 



^http : //www . gnu . org/sof tware/gsl/ 

■'littp : //www. ittvis . com/ProductServices/IDL. aspx 



off or restricted by setting the parameter specifying the number of steps to go down the 
split-and-merge hierarchy) . 
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The calculation of the posterior likelihoods qij in the E step (eq. [33]) can be tricky since 
it can be the ratio of a small probability over the sum of small probabilities, which can lead 
to underflow and a significantly different result for the g^jS. For instance, imagine a situation 
in which only one of the terms is slightly larger than the smallest positive number allowed 
by the compiler and all the other terms are slightly smaller than this number. The result 
of this would be that the qij corresponding to the Gaussian with the largest probability will 
be set to 1 and all the other qijS will be set to zero, while in reality the distribution is much 
more uniform than this. 

We will deal with this problem in two steps: (a) we will calculate not the numerator of 
qij but its logarithm, which will allow us to keep very small probabilities (since in general the 
smallest negative number is much smaller than the logarithm of the smallest positive number 
set by the compiler); (b) we will design a function that given the logarithms of all the g^jS 
returns the logarithm of their sum. Then we can normalize the numerators by subtracting 
the logarithm of their sum from their logarithms. This function performs a kind of 'logsum'. 
Cleverly designing this function will allow us to obtain the sum of a set of numbers, each of 
which is smaller than the smallest positive number. 

We will denote the numerators of the qij by q[j in what follows. The simplest way of 
summing a set of numbers given as logarithms would be to exponentiate each number and 
add it to the sum. However, exp(lng^^) might be below the smallest positive floating point 
threshold set by the compiler for each term. To avoid this underflow problem, we want to 
add a constant c to each of the logarithms such that when exponentiated they are all above 
the threshold, after which we can sum them and take the logarithm of the sum. Finally we 
can subtract this constant c again from the final answer. In short, we use the identity 



A. Logsum 




(Al) 
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To find what this constant c should be we will define DBLJVIIN to be the smallest 
positive floating point number available, and DBL_MAX to be the largest floating point 
number. Avoiding underflow is then achieved when each term is larger than DBL_MIN, or, 
when 

> DBL_MIN , (A2) 



exp 



mm{\nq'--) + c 
j 



which means 



c > ln(DBL_MIN) - min(lng,'.) . 



(A3) 



However, this can be quite large and will therefore sometimes lead to overflow for the 
larger terms in the sum, which is arguably a greater problem than the original underflow. 
Overflow will be avoided when the sum is smaller than DBL_MAX, which we can ensure by 
demanding 



exp (In + c) < K exp 



max(g - ■) + c 

j 



< DBL_MAX . 



where K is the number of terms in the sum. This means that we should have 



(A4) 



c < ln(DBL_MAX) - max(lng^,.) -\nK. 

3 



(A5) 



The second bound obtained here is the most stringent because overflow is worse than 
underflow. This is simply because the underflow will only ignore a very small term, whereas 
overflow will lead to an inflnite answer, which will affect any following calculation. Moreover, 
as we are adding probabilities, i.e., positive numbers, overflow of the kind described in the 
previous paragraph implies that the underflow is probably irrelevant (and if it is not, the 
situation is completely hopeless). Therefore, we should choose c as 

c = min(ln(DBLJvdIN) - min(lng^), ln(DBLJv4AX) - max(lng^.) - In A^) . (A6) 



B. Marginalization and conditioning of multivariate Gaussian distributions 

Suppose we are given a Gaussian distribution for a vector v with mean m and covariance 
matrix V which can be partitioned into a di-dimensional and a (i2-dimensional component 
{di + d2 = d) (vi, V2)^, (mi, m2)^, and 



Vn Vi2 
V21 V22 
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respectively, where — ^21- We can now ask what the distribution of V2 marginahzing 
over Vi is, and what the distribution of vi given V2 is? To answer this question we will 
rewrite the joint distribution function in such a way as to make clear the following identity 



p(v) =p(vi,V2) =p(Vi|v2)p(v2) , 

where p(v) = A/'(v|m, V). 

First we note that we can block-diagonalize the covariance matrix as follows: 



(Bl) 



I 



di 





-V12V2-21 

I<i2 



Vn V12 
V21 V22 



Id, 

-V2-2'V21 I,, 



V 



11 



Vi2V22'V2l 
V22 



(B2) 

Using this we can write the argument of the exponential in the Gaussian distribution as 
follows 
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which becomes using the identity in equation (B2) 
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which can be simplified to give the following 

Vn- 



vi - mi - Vi2V22^(v2 - ma) 
V2 - m2 



Vl2V2-2'V2i 







V 



22 



Vl - mi - Vi2V22^(v2 - m2) 
V2 - m2 

(B3) 

Introducing mi|2 = mi + Vi2V22^(v2 — m2) and Vi|2 = Vn — Vi2V22^V2i this becomes 



Vl - mi|2 
V2 - m2 



Vi|2 

V22 



Vl - mi|2 
V2 - m2 



-i(vi - mi|2)^V^|^(vi - mi|2) - ^(v2 - m2)^V22^(v2 - m2) . 
Using the identity in equation (B2) it can also be shown that 

det V = det Vi|2det V22 , 



(B4) 



(B5) 



since the determinants of the left and right multiplying matrices on the left hand side of 
equation (B2) are equal to one (for block triangular matrices the determinant is equal to the 
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product of the determinants of the diagonal blocks, these are both unit matrices). Therefore 
we can write 

A/'(v|m,V) = {27i)-'^/^det{V)-^/^exp(^-^{Y-myV-\Y~m)^ 
= (27r)-("'^+^^)/2 det(Vi|2)-^/' det(V22)-'/' 

X exp (^-^(vi - mi|2)^V^2(vi - mi|2) - ^(v2 - m2)^V22^(v2 - m2)^ 
= 7V(vi|mi|2,Vi|2)7V(v2|m2, V22) . (B6) 

We can now show that this factorization corresponds to the factorization given in equation 
(Bl) by marginalizing over Vi, i.e., 

p(v2) = J dviA/'(v|m, V) 

= J dvi A/'(vi|mi|2, Vi|2)A/'(v2|m2, V22) 
= Ar(v2|m2,V22) . (B7) 
Therefore, we can identify A/'(vi|mi|2, V112) with p(vi|v2). 
To summarize we can write 

p(vi|v2) = A/'(vi|mi + Vi2V^2^(v2 - 1112), Vii - Vi2V^2^V2i) (conditioning) (B8) 
p(v2) = A/'(v2|m2, V22) (marginalization) . (B9) 



C. The Expectation-Maximization algorithm and extensions 

The EM algorithm is a very general algorithm designed to deal with missing data in max- 
imum likelihood estimates and was first written down in full generality by Dempster et al. 
(1977), although versions of it had already been around longer (e.g., Wolfe 1970; Day 1969). 
The name Expectation-Maximization captures its main ingredients: the EM algorithm seeks 
to iteratively maximize the full-data likelihood (i.e., given data + missing data) but since 
some of the data is missing, it takes the expectation of this full-data likelihood given the 
data and the previous model estimate; the maximization step then maximizes this expec- 
tation of the likelihood with respect to the model parameters. The EM algorithm properly 
should not be called an "algorithm" since it does not specify the actual sequence of steps 
actually required to carry out a single E- or M-step, however, this generality has attributed 
to much of its success and it is now applied to a large variety of problems and numer- 
ous extensions and different interpretations of the original algorithm have been proposed 
(McLachlan and Krishnan 1997). 
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The original EM paper established its general properties and applied the algorithm to 
some specific examples, among which the case of finite mixtures. Among the EM algorithm's 
properties are that it converges monotonically to a local maximum of the likelihood func- 
tion (Dempster et al. 1977; Wu 1983), a property that is shared by the so-called generalized 
EM algorithms (GEM) in which the M-step merely increases the likelihood instead of max- 
imizing it. The convergence properties of the EM algorithm are generally very good, i.e., 
given an initial estimate far away from a local maximum it will converge rather quickly 
to a region of parameter space close to a local maximum and eventually reach the local 
maximum, as opposed to simple, generic optimizers such as the Newton-Raphson method 
which need a good initial estimate to find a local maximum and which does not converge 
monotonically. However, the convergence of the EM algorithm can be very slow, i.e., it 
is approximately linear, often many orders of magnitude slower than quadratic methods. 
Numerous methods have been proposed to accelerate the convergence, e.g., using the Ja- 
cobian to find improved estimates of the model parameters (Louis 1982); employing a gen- 
eralized conjugate gradient approach (Jamshidian and Jennrich 1993); considering a hybrid 
approach that switches to a quadratically converging optimization method close to the lo- 
cal maximum (e.g., Redner and Walker 1984; Jones and McLachlan 1992; Atkinson 1992; 
Aitkin and Aitkin 1996) (these hybrid approaches make use of the fact that most of the 
overall change in likelihood is achieved during the first few iterations (Redner and Walker 
1984)); or by extending the EM algorithm through the use of conditional maximization in 
the M-step (Meng and Rubin 1993; Liu and Rubin 1994). These methods are generally less 
stable and much more complicated than the standard EM algorithm (Lange 1995). EM is 
often easy to implement, especially when the E- and M- step are given by closed expressions, 
but a significant drawback is that it does not give an estimate of the covariance matrix 
of the MLEs. However, the EM algorithm can be extended to give such estimates, either 
by considering the information matrix (Louis 1982; Meng and Rubin 1989, 1991) or by a 
bootstrap approach (Efron 1979; Efron and Tibshirai 1993). 

In a nutshell we can describe the EM algorithm as follows: Suppose we have a likelihood 
function $ for a set of parameters 6', dependent on the data x and the missing data z. The 
likelihood $ can then be maximized iteratively in two steps (Dempster et al. 1977): In the 
k + l—th E-step we calculate the expectation of the likelihood given the data and the current 
parameter estimate, that is, the function 



This function is then maximized in the M-step, leading to the new parameter estimate 



Q{9\9k) = {<^\^,9,) . 



(CI) 



(C2) 
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For many problems that on the surface are not missing data problems it can nevertheless be 
useful to phrase them in such a way that they can be handled by the EM-algorithm, since 
often the maximization in the M-step is much simpler than the original maximization. For 
example, below we will show that by formulating the Gaussian mixture density estimation 
problem as a missing data problem, a completely analytical solution for the E- and M- 
steps can be found. Even when an analytical solution does not exist for the M-step, that 
maximization can (1) be lower dimensional than the original maximization; (2) be handled 
by a GEM algorithm (Dempster et al. 1977) in which a single step of a regular optimizer can 
be sufficient for overall convergence if it increases the likelihood; or (3) be reduced to several 
conditional M-steps for which analytical solutions exist or that are of a lower dimensionality 
still (Meng and Rubin 1993; Liu and Rubin 1994). 
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Fig. 1. — The Tully-Fisher relation from HST data: The absolute magnitude in five bands 
(B, V, R, I, and H) as a function of the velocity width W is well fit by a power-law for spiral 
galaxies. The relation is found by fitting a single Gaussian distribution to the distribution 
of points in the absolute magnitude-log^o W plane for each bandpass. 
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Fig. 2. — Two-dimensional projections of the three-dimensional velocity distribution of Hip- 
parcos stars using 10 Gaussians and w = A km^ s~^. The top right plot shows 1-sigma 
covariance ellipses around each individual Gaussian in the Vx-Vy plane; the thickness of each 
covariance ellipse is proportional to the natural logarithm of its amplitude aj. In the other 
three panels the density grayscale is linear and contours contain, from the inside outward, 
2, 6, 12, 21, 33, 50, 68, 80, 90, 95, 99, and 99.9 percent of the distribution. 50 percent of 
the distribution is contained within the innermost dark contour. The feature at Vy ^ —100 
km s~^ is real and corresponds to a known feature in the velocity distribution: the Arc- 
turus moving group; Indeed, all the features that appear in these projections are real and 
correspond to known features. 
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Fig. 3. — Convergence of the full algorithm: total log likelihood at each iteration step. 
Shown are only split-and-merge steps that improve the likelihood; each vertical gray line 
corresponds to a point at which split and merge is performed. For clarity's sake, we show in 
black only the parts of the split-and-merge steps at which the likelihood is larger than the 
likelihood right before that split-and-merge procedure; the log likelihoods of the steps in a 
split-and-merge procedure in which the likelihood is still climbing back up to the previous 
maximum in likelihood have been replaced by horizontal gray segments. The |/-axis has been 
cut off for display purposes: The log likelihood of the initial condition was -2.39E5. 



